Hardware Sizing

Hardware Sizing (Standalone Install)

Small Tier - 16 Core, 128G RAM (r5.4xlarge / E16s v3)

Component RAM Cores
Web 2g 2
Postgres 2g 2
Spark 100g 10
Overhead 10g 2

Medium Tier - 32 Core, 256G RAM (r5.8xlarge / E32s v3)

Component RAM Cores
Web 2g 2
Postgres 2g 2
Spark 250g 26
Overhead 10g 2

Large Tier - 64 Core, 512G RAM (r5.16xlarge / E64s v3)

Component RAM Cores
Web 4g 3
Postgres 4g 3
Spark 486g 54
Overhead 18g 4

Important  Collibra DQ requires a limit of 2TBs for large tier jobs. For DQ jobs that exceed 2TBs, you must filter down columns or rows.

Estimates

Sizing should allow headroom and based on peak concurrency and peak volume requirements. If concurrency is not a requirement, you just need to size for peak volume (largest tables). Best practice to efficiently scan is to scope the job by selecting critical columns. See Scaling your DQ Job for more information.

Bytes per Cell Rows Columns Gigabytes Gigabytes for Spark (3x)
16 1,000,000.00 25 0.4 1.2
16 10,000,000.00 25 4 12
16 100,000,000.00 25 40 120
16 1,000,000.00 50 0.8 2.4
16 10,000,000.00 50 8 24
16 100,000,000.00 50 80 240
16 1,000,000.00 100 1.6 4.8
16 10,000,000.00 100 16 48
16 1,000,000,000.00 100 1600 4800
16 100,000,000.00 100 160 480
16 1,000,000.00 200 3.2 9.6
16 10,000,000.00 200 32 96
16 100,000,000.00 200 320 960
16 1,000,000,000.00 200 3200 9600

Cluster

If your program requires more horsepower or (Spark) workers than the example tiers above which is fairly common in Fortune 500 companies than you should consider the horizontal and ephemeral scale of a cluster. Common examples include Amazon EMR and Cloudera CDP. Collibra DQ is built to scale up horizontally and can scale to hundreds of nodes.